{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2025 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Spatial understanding with Gemini 2.0\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fspatial-understanding%2Fspatial_understanding.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "84f0f73a0f76"
      },
      "source": [
        "| Authors |\n",
        "| --- |\n",
        "| [Guillaume Vernade](https://github.com/Giom-V) |\n",
        "| [Holt Skinner](https://github.com/holtskinner) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This notebook introduces object detection and spatial understanding with the Gemini API in Vertex AI.\n",
        "\n",
        "\n",
        "**YouTube Video: Building with Gemini 2.0: Spatial understanding**\n",
        "\n",
        "<a href=\"https://www.youtube.com/watch?v=-XmoDzDMqj4\" target=\"_blank\">\n",
        "  <img src=\"https://img.youtube.com/vi/-XmoDzDMqj4/maxresdefault.jpg\" alt=\"Building with Gemini 2.0: Spatial understanding\" width=\"500\">\n",
        "</a>\n",
        "\n",
        "\n",
        "You'll learn how to use Gemini to perform object detection like this:\n",
        "\n",
        "<img src=\"https://storage.googleapis.com/generativeai-downloads/images/cupcakes_with_bbox.png\" alt=\"Cupcakes with Bounding box\" width=\"500\">\n",
        "\n",
        "There are many examples, including object detection with\n",
        "\n",
        "* simply overlaying information\n",
        "* searching within an image\n",
        "* translating and understanding things in multiple languages\n",
        "* using Gemini thinking abilities\n",
        "\n",
        "**Note**\n",
        "\n",
        "There's no \"magical prompt\". Feel free to experiment with different ones. You can use the dropdown to see different samples, but you can also write your own prompts. Also, you can try uploading your own images.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "61RBz8LLbxCR"
      },
      "source": [
        "## Get started"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "No17Cw5hgx12"
      },
      "source": [
        "### Install Google Gen AI SDK\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tFy3H3aPgx12"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade --quiet google-genai pillow"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dmWOrTJ3gx13"
      },
      "source": [
        "### Authenticate your notebook environment (Colab only)\n",
        "\n",
        "If you're running this notebook on Google Colab, run the cell below to authenticate your environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NyKGtVQjgx13"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DF4l8DTdWgPY"
      },
      "source": [
        "### Set Google Cloud project information and create client\n",
        "\n",
        "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n",
        "\n",
        "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Nqwi-5ufWp_B"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"  # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n",
        "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n",
        "    PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n",
        "\n",
        "LOCATION = \"global\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "671b19874724"
      },
      "outputs": [],
      "source": [
        "from google import genai\n",
        "\n",
        "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5303c05f7aa6"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6fc324893334"
      },
      "outputs": [],
      "source": [
        "from IPython.display import display\n",
        "from PIL import Image, ImageColor, ImageDraw, ImageFont\n",
        "from google.genai.types import GenerateContentConfig, Part, SafetySetting\n",
        "from pydantic import BaseModel\n",
        "import requests"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e43229f3ad4f"
      },
      "source": [
        "### Load model\n",
        "\n",
        "Spatial understanding works best with the [Gemini 2.5 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2).\n",
        "\n",
        "For more information about all AI models and APIs on Vertex AI, see [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) and [Model Garden](https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/explore-models)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cf93d5f0ce00"
      },
      "outputs": [],
      "source": [
        "MODEL_ID = \"gemini-2.5-flash\"  # @param {type:\"string\", isTemplate: true}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f6f7dcbe8a48"
      },
      "source": [
        "We'll set the configuration to include a system instruction, safety settings, and a Pydantic class for Controlled Generation.\n",
        "\n",
        "The system instructions are mainly used to make the prompts shorter by not having to repeat the format each time. They are also telling the model how to deal with similar objects which is a nice way to let it be creative.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bd3142f4b7ba"
      },
      "outputs": [],
      "source": [
        "class BoundingBox(BaseModel):\n",
        "    \"\"\"\n",
        "    Represents a bounding box with its 2D coordinates and associated label.\n",
        "\n",
        "    Attributes:\n",
        "        box_2d (list[int]): A list of integers representing the 2D coordinates of the bounding box,\n",
        "                            typically in the format [y_min, x_min, y_max, x_max].\n",
        "        label (str): A string representing the label or class associated with the object within the bounding box.\n",
        "    \"\"\"\n",
        "\n",
        "    box_2d: list[int]\n",
        "    label: str\n",
        "\n",
        "\n",
        "config = GenerateContentConfig(\n",
        "    system_instruction=\"\"\"Return bounding boxes as an array with labels. Never return masks. Limit to 25 objects.\n",
        "    If an object is present multiple times, give each object a unique label according to its distinct characteristics (colors, size, position, etc..).\"\"\",\n",
        "    temperature=0.5,\n",
        "    safety_settings=[\n",
        "        SafetySetting(\n",
        "            category=\"HARM_CATEGORY_DANGEROUS_CONTENT\",\n",
        "            threshold=\"BLOCK_ONLY_HIGH\",\n",
        "        ),\n",
        "    ],\n",
        "    response_mime_type=\"application/json\",\n",
        "    response_schema=list[BoundingBox],\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "075fdc6fcab4"
      },
      "source": [
        "### Helper functions\n",
        "\n",
        "Create methods to draw the bounding boxes onto images."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "362ff535cfbc"
      },
      "outputs": [],
      "source": [
        "def plot_bounding_boxes(image_uri: str, bounding_boxes: list[BoundingBox]) -> None:\n",
        "    \"\"\"\n",
        "    Plots bounding boxes on an image with labels, using PIL and normalized coordinates.\n",
        "\n",
        "    Args:\n",
        "        image_uri: The URI of the image file.\n",
        "        bounding_boxes: A list of BoundingBox objects. Each box's coordinates are in\n",
        "                        normalized [y_min, x_min, y_max, x_max] format.\n",
        "    \"\"\"\n",
        "    with Image.open(requests.get(image_uri, stream=True, timeout=10).raw) as im:\n",
        "        width, height = im.size\n",
        "        draw = ImageDraw.Draw(im)\n",
        "        colors = list(ImageColor.colormap.keys())\n",
        "\n",
        "        # Load a font\n",
        "        font = ImageFont.load_default(size=int(min(width, height) / 100))\n",
        "\n",
        "        for i, bbox in enumerate(bounding_boxes):\n",
        "            # Scale normalized coordinates to image dimensions\n",
        "            abs_y_min = int(bbox.box_2d[0] / 1000 * height)\n",
        "            abs_x_min = int(bbox.box_2d[1] / 1000 * width)\n",
        "            abs_y_max = int(bbox.box_2d[2] / 1000 * height)\n",
        "            abs_x_max = int(bbox.box_2d[3] / 1000 * width)\n",
        "\n",
        "            color = colors[i % len(colors)]\n",
        "\n",
        "            # Draw the rectangle using the correct (x, y) pairs\n",
        "            draw.rectangle(\n",
        "                ((abs_x_min, abs_y_min), (abs_x_max, abs_y_max)),\n",
        "                outline=color,\n",
        "                width=4,\n",
        "            )\n",
        "            if bbox.label:\n",
        "                # Position the text at the top-left corner of the box\n",
        "                draw.text(\n",
        "                    (abs_x_min + 8, abs_y_min + 6),\n",
        "                    bbox.label,\n",
        "                    fill=color,\n",
        "                    font=font,\n",
        "                )\n",
        "\n",
        "        display(im)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "63dc6fe645cb"
      },
      "source": [
        "### Overlaying Information\n",
        "\n",
        "Let's start by loading an image of cupcakes.\n",
        "\n",
        "<img src=\"https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg\" alt=\"Cupcakes\" width=\"500\">\n",
        "\n",
        "Let's start with a simple prompt to find all items in the image."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fc5341343e32"
      },
      "outputs": [],
      "source": [
        "image_uri = \"https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg\"\n",
        "prompt = \"Detect the 2d bounding boxes of the cupcakes (with `label` as topping description)\"  # @param {type:\"string\"}\n",
        "\n",
        "response = client.models.generate_content(\n",
        "    model=MODEL_ID,\n",
        "    contents=[\n",
        "        prompt,\n",
        "        Part.from_uri(\n",
        "            file_uri=image_uri,\n",
        "            mime_type=\"image/jpeg\",\n",
        "        ),\n",
        "    ],\n",
        "    config=config,\n",
        ")\n",
        "\n",
        "print(response.text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "720c52d0a4ac"
      },
      "source": [
        "Next, we'll plot the bounding boxes on the image."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "06cac3d11f2f"
      },
      "outputs": [],
      "source": [
        "plot_bounding_boxes(image_uri, response.parsed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0cd84765aebd"
      },
      "source": [
        "We can see that Gemini created bounding boxes with labels for the different cupcakes."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "500d79bd18d8"
      },
      "source": [
        "### Search within an image\n",
        "\n",
        "Let's complicate things and search within another image for specific objects."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4ae41be5e764"
      },
      "outputs": [],
      "source": [
        "image_uri = \"https://storage.googleapis.com/generativeai-downloads/images/socks.jpg\"\n",
        "prompt = \"Show me the positions of the socks with a face. Label according to position in the image.\"  # @param [\"Detect all rainbow socks\", \"Find all socks and label them with emojis \", \"Show me the positions of the socks with a face. Label according to position in the image.\", \"Find the sock that goes with the one at the top\"] {\"allow-input\":true}\n",
        "\n",
        "response = client.models.generate_content(\n",
        "    model=MODEL_ID,\n",
        "    contents=[\n",
        "        prompt,\n",
        "        Part.from_uri(\n",
        "            file_uri=image_uri,\n",
        "            mime_type=\"image/jpeg\",\n",
        "        ),\n",
        "    ],\n",
        "    config=config,\n",
        ")\n",
        "\n",
        "print(response.text)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ea5c91502156"
      },
      "outputs": [],
      "source": [
        "plot_bounding_boxes(image_uri, response.parsed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "13efd076e581"
      },
      "source": [
        "### Use Gemini reasoning capabilities\n",
        "\n",
        "The model can also reason based on the image. You can ask it about the positions of items, their utility, or, like in this example, to find the shadow of a specific item."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bcfed0b338bd"
      },
      "outputs": [],
      "source": [
        "image_uri = \"https://storage.googleapis.com/generativeai-downloads/images/origamis.jpg\"\n",
        "prompt = \"Draw a square around the fox shadow\"  # @param [\"Find the two origami animals.\", \"Where are the origamis' shadows?\", \"Draw a square around the fox shadow\"] {\"allow-input\":true}\n",
        "\n",
        "response = client.models.generate_content(\n",
        "    model=MODEL_ID,\n",
        "    contents=[\n",
        "        prompt,\n",
        "        Part.from_uri(\n",
        "            file_uri=image_uri,\n",
        "            mime_type=\"image/jpeg\",\n",
        "        ),\n",
        "    ],\n",
        "    config=config,\n",
        ")\n",
        "\n",
        "print(response.text)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8aed398846b6"
      },
      "outputs": [],
      "source": [
        "plot_bounding_boxes(image_uri, response.parsed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f4feea1b8606"
      },
      "source": [
        "You can also use Gemini knowledge to enhance the labels returned. In this example Gemini, will give you advice on how to fix your little mistake.\n",
        "\n",
        "As you can see this time, you're only resizing the image to 1024px as it helps the model getting the bigger picture and give you advice. There's no clear rule about when to do it, experiment and find what works the best for you."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8933942bd3b2"
      },
      "outputs": [],
      "source": [
        "image_uri = \"https://storage.googleapis.com/generativeai-downloads/images/spill.jpg\"\n",
        "prompt = \"Tell me how to clean my table with an explanation as label\"  # @param [\"Show me where my coffee was spilled.\", \"Tell me how to clean my table with an explanation as label\"] {\"allow-input\":true}\n",
        "\n",
        "response = client.models.generate_content(\n",
        "    model=MODEL_ID,\n",
        "    contents=[\n",
        "        prompt,\n",
        "        Part.from_uri(\n",
        "            file_uri=image_uri,\n",
        "            mime_type=\"image/jpeg\",\n",
        "        ),\n",
        "    ],\n",
        "    config=config,\n",
        ")\n",
        "\n",
        "print(response.text)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a02617bd6278"
      },
      "outputs": [],
      "source": [
        "plot_bounding_boxes(image_uri, response.parsed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "994c73535ba5"
      },
      "source": [
        "### Try with more images\n",
        "\n",
        "Here are some more sample images to try prompting with Gemini.\n",
        "\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/vegetables.jpg\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/Japanese_Bento.png\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/fruits.jpg\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/cat.jpg\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/pumpkins.jpg\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/breakfast.jpg\n",
        "- https://storage.googleapis.com/generativeai-downloads/images/bookshelf.jpg"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "spatial_understanding.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
